Legacy Research Web Collections

   Critically Endangered small

Research related collections of digital content on the web which are now outdated and/or no longer actively maintained. This can include software and published or unpublished source code.

Digital Species: Web, Research Outputs

Trend in 2023:

No change No Change

Consensus Decision

Added to List: 2019

Trend in 2024:

No change No Change

Previously: Critically Endangered

Imminence of Action

Action is recommended within twelve months, detailed assessment is a priority.

Significance of Loss

The loss of tools, data or services within this group would impact on people and sectors around the world.

Effort to Preserve | Inevitability

Loss seems likely. By the time tools or techniques have been developed, the material will likely have been lost.

Examples

Academic and institutional websites from the first decade of the web containing details of research projects and interests as well as research data.

‘Practically Extinct’ in the Presence of Aggravating Conditions

Inaccessible to web archive; bespoke code; insufficient documentation; uncertainty over IPR or the presence of orphaned works.

‘Endangered’ in the Presence of Good Practice

Secured by web archive; documentation and rights information published alongside material.

2023 Review

This entry was added in 2019. While there are overlaps with ‘Semi-Published Research Data’ and ‘Unpublished Research Data’ entries, it is a separate entry to distinguish between ‘current’ and ‘legacy’ collections with different risk profiles. In 2020, the fact that materials of legacy web collections were no longer actively maintained increased the risk classification to Critically Endangered. The 2021 Jury agreed with these distinctions, adding that loss has already occurred and future loss can be prevented through approaches such as web archiving and code preservation. They identified a 2021 risk toward greater risk based on noted security issues posed by hosting legacy technology software and services which prompted disposal of content imminently without adequate review or selection. The 2022 Taskforce agreed with this assessment, noting no change to the trend (it remained on the same basis as before).

The 2023 Council agreed with the Critically Endangered classification with risks remaining on the same basis as before (‘No change’ to trend) but also noted a greater inevitability of loss compared to previous reviews. Additionally, the Council recommended that a received nomination for an entry, on unpublished digital indices and transcriptions in the DIMEV Open-Access Digital Edition of the Index of Middle English Verse, would provide a valuable example to this entry rather than as a new, standalone entry. The 2023 Council additionally recommended that the next major review considers rescoping the entry, possibly splitting this entry into separate areas to assess different levels of risk relating to published and unpublished source code in legacy research web collections. 

2024 Interim Review

These risks remain on the same basis as before, with no significant trend towards even greater or reduced risk (‘No change’ to trend).

Additional Comments

These collections are valuable but lose funding and care as institutions re-configure their tasks and individuals retreat from tasks due to retirement or (as volunteers) to old age.

There are an endless number of legacy research web resources out there that people don’t know about.

Not necessarily a technical challenge but a resource challenge

The Internet Archive and other national web archiving bodies have copies of a lot of websites that would fit into this category but by no means all. There’s also a distinction between the software or code used to deliver the user experience and the data. Such code is secondary to the content.

This issue can be intensified by the legacy IT Infrastructure in cases where much of the content is hosted there, as security concerns may lead to disposal of content imminently. In these scenarios, their imminence of action becomes more urgent given the security issues posed by hosting legacy technology/software/etc.

Case Studies or Examples:

  • The example of the British Library cyber incident as a case example of issues arising when working with legacy systems. See: The British Library (2024) ‘Learning Lessons from the Cyber-attack: British Library cyber incident review’, 8 March 2024. Available at: https://www.bl.uk/home/british-library-cyber-incident-review-8-march-2024.pdf/ [accessed 06 September 2024]

  • One example of an at-risk legacy research web collection, provided by the nominator of this entry, is the Unpublished digital indices and transcriptions in the DIMEV Open-Access, Digital Edition of the Index of Middle English Verse. The index comprises transcriptions made by a research team of Middle English text which were gathered as XML sheets and built upon a print publication: the Index of Middle English Verse (1943). These transcriptions involved significant financial and time investment and many are transcriptions of material unavailable online as digital facsimiles (uncertain data storage of the data that underlies the web resource, or whether it is being stored by a university or could easily be recovered). See Mooney, L., Mosser, D. Solopova, E., Thorpe, D., Hill Radcliffe, D., Hatfield, L., Cornelius, I. and Johnston, M. (n.d.) ‘The DIMEV: An Open-Access, Digital Edition of the Index of Middle English Verse’. Available at: https://www.dimev.net/ [accessed 24 October 2023]

  • The recovery of the VecNet archive of malaria-related publications offers another example that also has obvious public health implications. VecNet was founded in 2011 as a network of institutions assembled to address the concerns and recommendations of the Malaria Eradication Research Agenda initiative. It became a portal for malaria information and analysis tools, with the goal of extending present vector control interventions and enabling incorporation of additional interventions to achieve elimination. By 2019 an important component of the portal, the DataCite repository, ceased to be available. However, the Vector-Borne Disease Network Data Warehouse (VecNet-DW), a project of departments of University of Notre Dame and the Institute of Tropical Health and Medicine at James Cook University, retained the relevant data and is collaborating with Data Futures, which created the new Invenio repository. See Invenio (n.d.), ‘VecNet’. Available at: https://vecnet.nd.hasdai.org/ [accessed 24 October 2023].

  • Preserving the Carmichael Watson Research Project website at the University of Edinburgh: a case study on this project website, only online from 2013 until 2018, came to imminent risk of permanent loss and the strategy undertaken to transform it into a more sustainable format through web archiving and to revive its public accessibility. See Day Thomas, S. and Hawes, A. (2021) ‘Using ArchiveWeb.page to capture the Carmichael Watson Project’, Web Archiving & Preservation Working Group - General Meeting December 2021. Available at: https://www.youtube.com/watch?v=0CWMwJn6p-w [accessed 24 October 2023]

  • Fellgett, M. (2021) ‘Secure your digital datasets — by letting a data centre look after them!’, British Geological Survey Blogs. Available at: https://www.bgs.ac.uk/news/secure-your-digital-datasets-by-letting-a-data-centre-look-after-them/ [accessed 24 October 2023]


Scroll to top